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Abstract 

Background: Gene duplication and horizontal gene transfer are common processes in bacterial and archaeal 
genomes, and are generally assumed to result in either diversification or loss of the redundant gene copies. 
However, a recent analysis of the genome of the soil bacterium Azotobacter vinelandii DJ revealed an abundance of 
highly similar homologs among carbohydrate metabolism genes. In many cases these multiple genes did not 
appear to be the result of recent duplications, or to function only as a means of stimulating expression by 
increasing gene dosage, as the homologs were located in varying functional genetic contexts. Based on these 
initial findings we here report in-depth bioinformatic analyses focusing specifically on highly similar intra-genome 
homologs, or synologs, among carbohydrate metabolism genes, as well as an analysis of the general occurrence of 
very similar synologs in prokaryotes. 

Results: Approximately 900 bacterial and archaeal genomes were analysed for the occurrence of synologs, both in 
general and among carbohydrate metabolism genes specifically. This showed that large numbers of highly similar 
synologs among carbohydrate metabolism genes are very rare in bacterial and archaeal genomes, and that the A 
vinelandii DJ genome contains an unusually large amount of such synologs. The majority of these synologs were 
found to be non-tandemly organized and localized in varying but metabolically relevant genomic contexts. The 
same observation was made for other genomes harbouring high levels of such synologs. It was also shown that 
highly similar synologs generally constitute a very small fraction of the protein-coding genes in prokaryotic 
genomes. The overall synolog fraction of the A vinelandii DJ genome was well above the data set average, but not 
nearly as remarkable as the levels observed when only carbohydrate metabolism synologs were considered. 

Conclusions: Large numbers of highly similar synologs are rare in bacterial and archaeal genomes, both in general 
and among carbohydrate metabolism genes. However, A. vinelandii and several other soil bacteria harbour large 
numbers of highly similar carbohydrate metabolism synologs which seem not to result from recent duplication or 
transfer events. These genes may confer adaptive benefits with respect to certain lifestyles and environmental 
factors, most likely due to increased regulatory flexibility and/or increased gene dosage. 
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Background 

Genes sharing a common origin, without any further 
specification of their evolutionary relationship, are classi- 
fied as homologs, while paralogs and orthologs consti- 
tute subcategories of homologs. Orthologs evolve from a 
common ancestral gene via vertical descent (speciation), 
while paralogs evolve by duplication events taking place 
after speciation [1]. Apparent gene duplications may also 
result from horizontal gene transfer (lateral transfer of 
genetic material between species), and such homologs 
are classified as xenologs. 

Gene duplication was first developed as a coherent 
concept by Ohno more than 40 years ago [2], but the 
prevalence and importance of gene duplication was not 
clearly demonstrated until fully sequenced genomes be- 
came available. Estimates of numbers of duplicated 
genes in genomes of bacteria, archaea and eukarya have 
shown that large proportions of genes have been gener- 
ated by gene duplication in all three domains [3]. In eu- 
bacterial systems at least three, often partly overlapping, 
biologically relevant roles of gene duplication can be dis- 
tinguished: (a) to confer a solution to a selective prob- 
lem, (b) to facilitate further genetic adaptation, and (c) 
to give rise to genetic innovation and novel biochemical 
function [4]. 

Following a duplication event there are three possible 
fates for the duplicate genes: deletion, silencing (non- 
functionalization), or selection [2]. In cases of selection 
three major outcomes have been described: evolution of 
a new beneficial function in one duplicate (neofunctio- 
nalization), partitioning of ancestral functions between 
the duplicates (subfunctionalization), or conservation of 
ancestral function(s) in both duplicates (gene conserva- 
tion) [2,5-7]. It is however likely that paralog evolution 
may involve both subfunctionalization and neofunctio- 
nalization, either concurrently or sequentially [1]. There 
is also evidence for a mechanism where there is no "in- 
vention" of new gene function(s) or direct partitioning 
of ancestral functions, but rather amplification followed 
by improvement of extant weak or secondary functions 
in duplicate genes [8-10]. Most studies on the outcome 
of retained gene duplications have been carried out in 
eukaryotic systems, where gene and genome duplications 
are currently regarded as the major sources for develop- 
ment of new functions. However, mechanisms for gene 
duplication in plants and animals and references therein] 
differ from those in prokaryotes [11,12] and references 
therein. In addition, polyploidy and cellular differenti- 
ation plays a major role in paralog evolution in many 
eukaryotes. 

Gene duplications are among the most common muta- 
tions in bacteria, but show high intrinsic instability. Most 
tandem duplication events in bacterial genomes are asso- 
ciated with naturally occurring repetitive sequences, such 



as rRNA genes, transposable elements and other repetitive 
elements, but gene duplication can also occur by pro- 
cesses capable of random end-joining in the complete ab- 
sence of any repetitive sequence [4]. Duplicated genes in 
bacteria appear to be created mainly by small-scale gene 
duplication events, and the majority of retained duplicated 
genes occur as single genes, not duplicated operons [13]. 
The total number of paralogs in microbial genomes has 
been shown to correlate with genome size [8]. 

Microbial genomes also acquire intra-genome homologs 
via horizontal gene transfer, leading to xenologs. For most 
bacterial and archaeal genomes there appears to be a high 
level of horizontal gene transfer events in general [14,15]. 
Xenologs can also be described as pseudoparalogs. It wUl 
usually not be possible to discern between pseudoparalogs 
and true paralogs in a single-genome analysis [1]. Thus, 
Lerat et al. [15] have proposed the term "synologs" to de- 
scribe intra-genome homologs arising from either duplica- 
tion or horizontal transfer. The term was originally 
introduced by Gogarten [16] with a different definition, 
but the definition by Lerat et al. seems to be preferred and 
will be used in this study. For y-proteobacteria (the class 
Azotobacter vinelandii [see below] belongs to) it has been 
shown that horizontally acquired genes only rarely have 
pre-existing homologs within the recipient genome. But 
because synologs are rare in general, lateral gene transfer 
still contributes substantially to synology [15]. Further- 
more, subsequent duplications appear to be more com- 
mon among laterally transferred genes. This is probably 
due to selection for gene dosages, as recently imported 
genes are likely to have poorly optimized functions [17]. 

During annotation work on the genome of the meta- 
bolically versatile nitrogen-fixing soil bacterium A. vine- 
landii DJ [18], we observed that the genome of this 
organism appeared to have a high frequency of highly 
similar synologs among proteins involved in carbohy- 
drate metabolism. The 5.4 Mb genome contains approxi- 
mately 5000 protein-coding genes, out of which ~70% 
have been given a functional assignment [18]. To further 
investigate the abovementioned observation 943 bacter- 
ial and archaeal genomes that were available in the 
SEED database [19] at the time of the study were ana- 
lysed for the occurrence of carbohydrate metabolism 
synologs. The general opinion is that in most cases, du- 
plicated genes will diverge over time. This is consistent 
with Hooper and Berg's observation that bacterial and 
archaeal genomes show roughly the same pattern of 
paralogs in groups of decreasing amino acid identity [8]. 
Thus, a high incidence of highly similar synologs could 
be explained either by a recent burst of duplications or 
repeated horizontal gene transfers, or by a mechanism 
for gene conservation operating on these specific genes. 
The latter would imply that retention of highly similar 
synologs benefits the organism adaptively. The results of 
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this study support that sequence conservation of multiple 
gene copies can be an adaptive strategy for bacteria to 
enhance their ability to cope with various environmental 
conditions. 

Results and discussion 

The A. vinelandii DJ genome contains abundant 
carbohydrate metabolism synologs, many of which 
display a high degree of sequence identity 

Annotation work on the genome of soil bacterium A. 
vinelandii DJ [18] revealed a high frequency of synologs 
among proteins involved in carbohydrate metabolism. 
To determine whether this is also the case for the closely 
related pseudomonads, and whether synologs are com- 
mon for other functional categories in these genomes, 
the occurrence of protein families in the genomes of A. 



vinelandii DJ and 15 fully sequenced strains in the genus 
Pseudomonas was investigated using OrthoMCL [20], 
which uses pairwise sequence alignments to identify groups 
of similar proteins (protein families) in proteomes. In this 
analysis, members of the same intra-genome protein fam- 
ily were regarded as synologs. The number of identified 
families was markedly higher in A. vinelandii DJ than in 
the pseudomonad strains with regard to proteins involved 
in both carbohydrate metabolism (Figure la) and carbohy- 
drate transport (Figure lb), but not for any of the other 
functional categories. In addition, carbohydrate related 
protein families with more than two members were more 
common in A. vinelandii (Figure 1). Another interesting 
outcome of this initial study was that more than half of 
the identified A. vinelandii protein families assigned to 
carbohydrate metabolism contained synologs that shared 
>90% protein sequence identity. 
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Figure 1 Distribution of carboliydrate metabolism and transport protein families in A. vinelandii and pseudomonad genomes. The 

number of intra-genome protein families (identified using OrthoMCL) assigned to the functional categories a) carbohydrate metabolism and 
b) carbohydrate transport in the genomes ot A. vinelandii DJ and 15 fully sequenced strains in the genus Pseudomonas. This illustrates that for 
carbohydrate metabolism the number of synologs is clearly higher in the A vinelandii genome compared to the Pseudomonas strains included in 
this study. The total number of families for each genome is shown as stacked columns, with block patterning indicating the number of proteins 
in the identified families. Members of the same intra-genome protein family were regarded as synologs. 
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An abundance of carbohydrate metabolism synologs is 
uncommon in prolcaryotic genomes 

To investigate the occurrence of synologs among carbo- 
hydrate metabolism genes in a wider range of organisms, 
data was extracted from the SEED database [19]. One of 
the challenges in comparative genomics is that the ex- 
tent and quality of annotation, as well as the approaches, 
can vary widely between genomes. The SEED provides 
an environment in which all genomes have been sub- 
jected to the same annotation pipeline, with curation ex- 
ecuted across many genomes. Extracting data from this 
database can thus be expected to yield comparable gene 
sets for the included genomes. Sequences belonging to 
the "Carbohydrates" category (including genes involved 
in transport) were extracted for the 57 archaeal and 886 
bacterial genomes available in the SEED database at the 
time of the study. 

The SEED database sorts protein-coding sequences 
into FIGfams where all members are believed to have 
the same functional role(s). Sequences from the same 
genome belonging to the same FIGfam, regardless of 
level of sequence identity, were regarded as synologs. 
Out of the 943 genomes, 46 (i.e. 4.9%) were found not 
to contain any synologs among carbohydrate metabolism 
genes. The gene sets extracted from SEED included both 
functional genes and genes annotated as possible pseu- 
dogenes, but the gene sequences were translated before 
the data set was analysed. This ensured that pseudogene 
sequences containing frameshifts or deletions resulting 
in severely changed proteins would not be included in 
the identified sets of highly similar synologs, as the se- 
quence set was also checked for severely truncated proteins 
(see Methods). Analysing the data set with emphasis on 
total number of synologs and number of synolog groups 
(sets of two or more intra-genome synologs) yielded prac- 
tically indistinguishable qualitative results due to the strong 
positive correlation between these values (R^ = 0.95 for a 
linear regression). 

When all synologs were included (regardless of level of 
sequence identity), A. vinelandii DJ was found to contain 

Table 1 Summary of statistical data for SEED data sets^ 



145 carbohydrate metabolism synologs distributed into 52 
synolog groups. These synologs constituted 45.6% of the 
total carbohydrate metabolism genes in the genome, 
hereafter referred to as the synolog fraction. This is sig- 
nificantly higher than the median numbers found when 
analysing all 943 genomes (Table 1), and placed A. vine- 
landii DJ in the upper 9% when strains were ranked 
based on number of synolog groups. 

Different strains of the same species generally displayed 
similar synolog frequencies and levels of average synolog 
identity. Only 13% of the genomes contained more than 
46 (median + 2x MAD) synolog groups (Figure 2a) and 
synologs constituted more than 49.4% (median + 2x 
MAD) of carbohydrate metabolism genes in only 9% of 
the genomes (Figure 2b). 

Additionally, the results indicated that carbohydrate 
metabolism synologs with a high level of sequence identity 
is not a common phenomenon in bacterial and archaeal 
genomes, as the average identity between synolog pairs 
was <50% in 96% of the 897 genomes which contained 
such synologs (Figure 2c). Furthermore, 24 out of the 35 
genomes with average sequence identity >50% contained 
less than ten synolog groups. 

Large numbers of carbohydrate metabolism synologs 
with high identities are rare in prokaryotic genomes, and 
is not correlated with total number of protein coding 
genes 

Based on the results from the initial investigations 
described above, the data set extracted from SEED was fil- 
tered to identify synolog groups with protein sequence 
identities >90%. Such synologs were found to be present 
in 374 (i.e. 40%) of the genomes. It should be noted that 
the average sequence identity between synolog pairs in 
this data set was -97% (Table 1), which means that the 
majority of the synologs has a considerably higher degree 
of identity than the applied cutoff value. The high correl- 
ation between number of synologs and number of synolog 
groups was still present at this cutoff (R^ = 0.97 for a linear 
regression). As can be seen in Table 1, applying this cutoff 



All synolog pairs 



>90% protein sequence identity between 
synolog pairs 





Median ± MAD 


Min 


Max 


Median ± MAD 


Min 


Max 


Number of carbohydrate metabolism synologs 


49.0 ±35.0 


0 


394 


0.0 ± 0.0 


0 


47 


Number of carbohydrate metabolism synolog groups 


20.0 ± 1 3.0 


0 


128 


0.0 ± 0.0 


0 


16 


Average protein sequence identity between synolog pairs^ [%] 


36.8 ±4.2 


134 


100.0 


97.3 ± 1.8 


90.0 


100.0 


Synolog fraction of carbohydrate metabolism genes [%] 


30.0 ±9.7 


0.0 


85.7 


0.0 ± 0.0 


0.0 


34.6 



^Median, minimum and maximum values for the carbohydrate metabolism gene set extracted from 943 prokaryote genomes in the SEED database [19], with no 
set cutoff and with a cutoff set at 90% protein sequence identity between synologs. Synologs are here defined as intra-genome sequences assigned to the same 
FIGfam (see text). The synolog fraction describes the ratio of the total number of synologs relative to the total number of genes in a genome. MAD is median 
absolute deviation. The median number of carbohydrate metabolism genes in the data set was 1 60.0 ± 74.0. The minimum and maximum numbers of carbohydrate 
metabolism genes observed among the included genomes were 4 and 585 genes, respectively. 
^Calculated from the genomes containing carbohydrate metabolism synologs at the given cutoff. 
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Figure 2 Distribution of synolog groups, synolog fractions and average synolog sequence identity for carbohydrate metabolism genes. 

Distribution of a) number of synolog groups, b) synolog fractions and c) average synolog sequence identity at the protein level for carbohydrate 
metabolism synologs in a data set consisting of 943 [a)-b) ] or 897 [c) ] bacterial and archaeal genomes, illustrating that a high fraction of very 
similar carbohydrate metabolism synologs is rare among the genomes included in this analysis. Synologs are here defined as intra-genome 
sequences assigned to the same FIGfam in the SEED database [19]. The synolog fraction describes the ratio of the total number of carbohydrate 
metabolism synologs relative to the total number of carbohydrate metabolism genes in a genome. 



reduced the median number of synolog groups from 20 ± 
13 to 0 ± 0, demonstrating that high frequencies of very 
similar carbohydrate metabolism synologs are indeed rare. 
This is also evident from Figure 3, which shows how the 



number of highly similar synolog groups for carbohydrate 
metabolism is distributed across genomes. 

As could be expected, there was a positive correlation 
between the number of carbohydrate metabolism synolog 
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Figure 3 Occurrence of synologs with >90% identity among carbohydrate metabolism proteins in bacteria and archaea. Number of 
carbohydrate metabolism synolog groups witii internal protein sequence identity >90% identified in 943 investigated prokaryotic genomes. 
Synolog groups are here defined as two or more intra-genome sequences assigned to the same FIGfam in the SEED database [19]. 



groups in the genomes and the total number of carbohy- 
drate metabolism genes, total number of protein-coding 
genes or genome size (R = 0.80, 0.56 and 0.51 respectively 
for a linear regression). There was however no clear cor- 
relation when only synologs displaying protein sequence 
identities >90% were included (R^ = 0.14, 0.15 and 0.13 re- 
spectively), so a high frequency of highly similar synologs 
in a genome cannot simply be attributed to a large gen- 
ome or a large number of genes in general. 

The A. vinelandii DJ genome harbours an unusually large 
number of highly similar carbohydrate metabolism 
synologs 

When the threshold of minimum 90% protein sequence 
identity between synologs was applied, the A. vinelandii 
DJ genome was found to harbour 38 carbohydrate me- 
tabolism synologs distributed into 15 synolog groups 
(Table 2). At this sequence identity cutoff, 10 or more 
synolog groups were observed in only 1% of the analysed 
genomes (Figure 3). Thus, the results confirm that the A. 
vinelandii DJ genome does indeed harbour an unusually 
large number of highly similar carbohydrate metabolism 
synologs compared to the majority of sequenced bacterial 
and archaeal genomes. This observation also holds true if 
the analysed strains are ranked according to synolog frac- 
tions rather than number of synolog groups. 



The majority of highly similar carbohydrate metabolism 
synologs in A. vinelandii DJ belong to core metabolic 
processes 

SEED divides the "Carbohydrates" category into eleven 
subcategories, which enabled us to investigate the distri- 
bution of synologs in various parts of the carbohydrate 
metabolism. In the A. vinelandii DJ genome, all but two 
synolog groups occurred in the subcategories "Central 
carbohydrate metabolism" and "Fermentation" when the 
cutoff of 90% sequence identity was applied. In the 
"Central carbohydrate metabolism" subcategory, genes 
for glucose 6-phosphate dehydrogenase {zwf), transketolase 
{tktA), glucose 6-phosphate isomerase ipgi), triosepho- 
sphate isomerase (tpiA), 2,3-bisphosphoglycerate-independ- 
ent phosphoglycerate mutase ipgm), enolase {end), pyruvate 
kinase (pyl<A) and transaldolase (talB) were found to have 
synologs with minimum 90% sequence identity, while 
genes encoding electron transfer flavoprotein subunits 
{etfA and etfB), electron transfer flavoprotein ubiquinone 
oxidoreductase, acetaldehyde dehydrogenase (mhpFj and 
4-hydroxy-2-oxovalerate aldolase {xylKlmphE) were found 
to have such synologs in the "Fermentation" subcategory. 

For both subcategories the genes in a given synolog 
group were found to be localized in partly or completely 
different genomic contexts, and often as parts of differ- 
ing probable operons. In several cases, genes from two 



Table 2 The genomes with the highest levels of very similar carbohydrate metabolism synologs^ 



Genome 



Number of synolog groups 



Total Central 

carbohydrate 
metabolism 



Organic Di- and oligo- 
acids saccharides 



Fermentation One-carbon 
metabolism 



CO2 Amino- Poly- Carbohydrates - Sugar Mono- 

fixation sugars saccharides no sub-category alcohols saccharides 



Soil/sediments 

Clostridium beijerinckii 1 5 
NCIMB 8052 

Azotobacter vineiandii 1 5 
DJ 

Burkinoideiia 1 2 

xenovorans LB400 

Baciiius cereus 10 

Nai<amureiia 9 
muitipartita DSM 
44233 

Rhodoferax 9 
ferrireducens DSM 
15236 

BuMoideria cepacia 9 
R18194^ 

Nitrobacter 9 
hawburgensis XI 4 

Franfab sp. EAN1 pec 9 

Paracoccus denitrificans 8 
PD1222 

Ralstonia eutropha 8 
JMP134 

Burkliolderia 8 
vietnamiensis strain 
G4^ 

Marine/aquatic 

Shewaneila baitica 1 2 
OS155 

Methylobacilius 8 
fiageliaWs KJ 

Pathogens 

Vibrio choierae MZO-3^ 1 6 

Shigella dysenteriae 1 6 
M131649 

Escherichia coil B7A^ 1 2 



10 



Table 2 The genomes with the highest levels of very similar carbohydrate metabolism synologs^ (Continued) 



Streptococcus 113 - 6 

pneumoniae 0XC141^ 

Eschehcliia coll 8 - - - 

El 10019^ 

Commensals 

Streptococcus mitis 9 1 - 6 

NCTC 12261^ 

^The table lists the twenty genomes with the largest number of synolog groups among carbohydrate metabolism genes when a threshold of at least 90% amino acid sequence identity was used. The data set was 
extracted from the SEED database [19] and synologs were defined as intra-genome sequences assigned to the same FIGfam (see text}. The total number of such synolog groups in these genomes as well as their 
distribution in the eleven subcategories defined in the SEED database is shown. The median number of synolog groups for the genomes in this data set was 2.0 ± 1 .0. 
^Opportunistic pathogen. 

^All synologs in this table were evaluated manually with regards to genomic context. The manual evaluation revealed that several of the synologs in V. cholerae MZO-3, f. coll B7A, S. pneumoniae 0XC141, S. mitis NCTC 
1 2261 and E coli El 1 001 9 might be mistakenly identified as highly similar synologs due to overlapping contigs or the presence of truncated sequences. These sequences were therefore disregarded in interpretation 
of the results. 
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or more synolog groups were localized in the same gene 
cluster or putative operon (Figure 4). The majority of 
the "Fermentation" subcategory synologs were clustered 
with genes involved in the utilization of aromatic com- 
pounds as energy sources, while the "Central carbohy- 
drate metabolism" synologs were always clustered with 
other carbohydrate metabolism genes; mainly genes in- 
volved in transport and degradation of various sugars. 
The varying, but metabolically relevant genomic con- 
texts indicate that these highly similar synologs are not 
the result of very recent duplications and/or repeated 
horizontal gene transfers, as adaptive selection resulting 
in favourable genomic arrangements appears to have 
taken place along with conservation of the duplicated 
sequences, even though the former can be assumed to 
be a very slow process [21,22]. 

It has been suggested that organizing genes in operons 
could be inherently suboptimal, as it may prevent fine- 



tuning of expression, and that operons might exist despite 
this disadvantage because they facilitate the evolution of 
co-regulation [23,24]. Thus, it is also possible that retain- 
ing highly similar synologs in different operonic contexts 
might be advantageous with regard to co-regulation while 
at the same time leaving more room for fine-tuning pro- 
duction of a particular protein. The localization of highly 
similar synologs in varying but metabolically relevant 
genomic contexts was also observed to be common 
among other genomes containing large numbers of such 
synologs. 

Comparison to other A. vinelandii genomes supports 
conservation of sequences 

At the time of this study, A. vinelandii DJ was the only 
published genome from the genus Azotobacter, making 
it difficult to assess whether the occurrence and genomic 
localization of highly similar synologs is similar in other 
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Figure 4 Genomic contexts for carbohydrate metabolism synologs in A. vinelandii DJ with >90% identity. The figure sliows tliat the 
highly similar /I. vinelandii DJ carbohydrate metabolism synologs identified in this study are found in differing genomic contexts. Synologs are 
here defined as intra-genome sequences assigned to the same FIGfam in the SEED database [19] and a threshold was set to include only 
synologs displaying at least 90% protein sequence identity. Synolog groups are highlighted as same-coloured arrows. Striped arrows represent 
other genes annotated as carbohydrate metabolism genes, checkered arrows represent genes annotated as aromatic compounds metabolism 
genes, meshed arrows represent genes annotated as electron transport genes, white arrows represent genes annotated as belonging to other 
functional categories, and dotted arrows represent genes annotated as encoding hypothetical proteins or proteins of unknown function. Genes 
that had not been assigned a gene name in the annotated A vinelandii DJ genome are marked with the number corresponding to their respective 
genelDs, which in the genome annotation are written on the form Avin##### [18]. 
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Strains, which would support conservation as an import- 
ant factor. However, during the preparation of this paper 
the genome sequence of A. vinelandii CA (also known 
as UW or OP) was published [25]. All the genome frag- 
ments depicted in Figure 4 were found to have 100% 
identical homologous sequences in strain CA. This is 
not surprising, as A. vinelandii DJ is a variant of strain 
CA generated in 1984 [18]. Another strain, A. vinelandii 
E, was shotgun sequenced as part of the A. vinelandii 
genome project resulting in the publication of the A. 
vinelandii DJ genome. Strain CA, and thus DJ, are la- 
boratory strains derived from A. vinelandii O [26] of 
which the earliest report dates back to 1952 [27], when 
it was part of the bacterial collection at the University of 
Wisconsin, Madison, US. Strain E was isolated from a 
soil sample collected in Trondheim, Norway in the 
1960's. 

Due to our participation in the A. vinelandii genome 
project we have access to the strain E shotgun sequence, 
which currently consists of 3947 contigs with lengths 
ranging from 100 bp to 57 378 bp. Analysis of the avail- 
able contigs showed that several of the above mentioned 
synologs are found in the same arrangements in this or- 
ganism, and with gene sequences nearly identical to 
those in A. vinelandii DJ. This also supports that these 
highly similar A. vinelandii carbohydrate metabolism 
synologs did not originate from recent duplication/hori- 
zontal transfer events, as it is reasonable to assume that 
a considerable number of generations have passed since 
these two strains' last common ancestral cell. There is 
not much data available on how quickly, and by what 
mechanisms, free-living non-pathogenic bacteria evolve 
in their natural environments, but a recent study by 
Denef and Banfield [28] showed that a series of import- 
ant divergence events occurred over a time scale of years 
to decades in Leptospirillum biofilms in a mine drainage 
system. These included nucleotide substitutions, recom- 
bination and insertion/deletion events. 

We considered the possibility that the sequences of 
the abovementioned conserved proteins are so crucial 
for function that even minor changes would be deleteri- 
ous and thus heavily selected against. However, a com- 
parison of Zwf, TktA and PykA sequences from A. 
vinelandii, P. stutzeri, P. aeruginosa, P. fluorescens and 
E. coli revealed sequence identities as low as 40%, indi- 
cating considerable room for changes without loss of 
functionality. 

The occurrence and distribution of highly similar 
carbohydrate metabolism synologs might be connected 
to lifestyle and environmental factors 

A. vinelandii is widely distributed in soil and water and 
is a very metabolically versatile species. The majority of 
the other bacteria with large numbers of highly similar 



carbohydrate synologs (Table 2) are also found in soil, 
sediments or water, but this could be due to a prevalence 
of this kind of organisms among sequenced bacterial ge- 
nomes. Gene duplication and divergence are however as- 
sumed to be important for adaptation in environments 
of scarce, but diverse nutrients, such as soil [29]. Even 
though the synologs studied here are highly similar it is 
possible that such carbohydrate metabolism synologs 
could confer advantages with regard to utilization of the 
available carbon sources, as independently regulated 
synologs might allow a higher degree of flexibility and 
efficiency in situations of simultaneous utilization of 
varying combinations of nutrients. As discussed above, 
the identified A. vinelandii DJ synologs were often found 
in varying, but metabolically relevant genomic contexts. 
The same was observed for the other soil bacteria in 
Table 2. Divergent expression of duplicate genes is well- 
known in eukaryotes, and expression divergence has 
been shown to be common even for highly similar gene 
pairs in yeast [30]. Duplications have also been shown to 
have a key role in evolution of gene regulatory networks 
in both E. coli and yeast [31]. In addition to regulatory 
advantages, the organisms could also gain benefits sim- 
ply from increased gene dosages. 

Table 2 lists the twenty genomes with the highest 
number of synolog groups among carbohydrate metabol- 
ism genes when the threshold of minimum 90% se- 
quence identity was applied, and shows the distribution 
of these synologs in the SEED database's eleven "Carbo- 
hydrates" subcategories. For most of the genomes in 
Table 2 the majority of the observed synolog groups are 
clustered in one or two of the subcategories, although 
which subcategory/-ies varies for the different strains 
(discussed below). This could indicate that in some or- 
ganisms, retention of highly similar carbohydrate metab- 
olism synologs is not a completely random process, but 
rather that certain genes are more prone to be retained 
and conserved than others due to selective pressure or 
other mechanisms. 

As can be seen in Table 2, highly similar carbohydrate 
metabolism synologs in bacterial genomes are by far 
most common in the subcategory "Central carbohydrate 
metabolism", as already observed for A. vinelandii. This 
could reflect adaptive benefits of increased gene dosage 
or regulatory flexibility in these central processes, al- 
though it could also be attributed to the fact that the 
central metabolism has been studied more extensively 
than peripheral processes, making it more likely that 
genes belonging to this subcategory will be well anno- 
tated in the genomes. 

However, clear links to environment and lifestyle could 
be found in some cases. It was observed that the major- 
ity of the highly similar synologs identified in the gen- 
ome of Nitrobacter hamburgensis X14, a bacterium 
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which can oxidize nitrite while fixing carbon dioxide 
(CO2) as its sole carbon source [32], occur in the subcat- 
egory "CO2 fixation". This is an important metabolic 
process in this species, as exposure to CO2 (or addition of 
sodium carbonate) has been shown to be required for 
optimum growth of the organism even in the presence of 
organic carbon [33]. Furthermore, the majority of highly 
similar synologs in the Bacillus cereus E33L genome were 
found among genes encoding myo-inositol catabolism 
proteins. B. cereus is found in soil, where myo-inositol is 
abundant, and one of the virulence factors of B. cereus 
E33L is a phosphatidyl-inositol-specific phospholipase [34] . 

All the genomes in Table 2 belong to bacteria, but simi- 
lar observations were also made for archaeal genomes, 
where the largest numbers of highly similar carbohydrate 
metabolism synologs were found in the genomes of 
Methanocella sp. RC-I (7 synolog groups), Methanosar- 
cina acetivorans C2A (6 synolog groups), Methano- 
coccoides burtonii DSM 6242 (5 synolog groups) and 
Methanosarcina mazei Gol (5 synolog groups). These four 
organisms use methanogenesis (a form of anaerobic respir- 
ation) as their energy-yielding metabolic process [35-38], 
and all the identified synologs were found among meth- 
anogenesis genes. Thus, there does appear to be a connec- 
tion between environmental factors or certain aspects of a 
prokaryotic organism's lifestyle and the types of genes 
which have highly similar synologs. This observation is 
also supported by earlier studies showing that retained 
paralogs can be beneficial for coping with environmental 
fluctuations [39] and that retained duplications are more 
common among genes involved in adaptation to environ- 
mental conditions [40] . 

Presumably, at least some of the identified synologs in 
this analysis are the results of recent duplication and/or 
repeated horizontal transfer events. This seems to be the 
case in for example the Shigella dysenteriae M131649 
genome, where the identified synologs were observed to 
be localized in a wider range of genomic contexts, and 
nearly always non-operonic and in close proximity to 
mobile genetic elements. 

Highly similar synologs are generally rare in bacterial and 
archaeal proteomes 

Finally we wanted to get a broader perspective on the 
occurrence of highly similar synologs overall, not just 
among carbohydrate metabolism genes or limited to 
SEED annotation. Therefore 897 bacterial and archaeal 
genomes available in the NCBI database at the time of 
the study were analysed for all intra-genome protein 
coding sequences (CDSs) showing at least 90% protein 
sequence identity along >90% of the full length of the 
compared sequences, using a strategy similar to the one 
employed by OrthoMCL. A set of two or more such ho- 
mologs was termed a synolog group. More commonly a 



cutoff of 30% identity is used to identify homologous 
protein sequences [41,42], but this study aimed to iden- 
tify only sequences with a very high level of identity. 

While the total number of protein-coding sequences 
in the analysed genomes has an approximately symmetric 
distribution [see Additional file 1: Figure Sla], the distri- 
butions of total number of synolog groups as well as syno- 
log fractions have large positive skews (Figure Sib and c 
[see Additional file 1]), i.e. the majority of the genomes 
display very low values. Thus, the analyses showed that 
highly similar synologs generally comprise a very small 
fraction of the predicted proteomes of bacteria and ar- 
chaea (hereafter referred to as the synolog fraction). 

With the cutoff at 90% identity, the median synolog 
fraction (± median absolute deviation; MAD) was 1.9 ± 
1.2%. Only 40 out of the 897 analysed genomes had 
synolog fractions >10% at a cutoff of 90% identity. The 
synolog fractions of these 40 genomes are presented in 
[see Additional file 1: Table SI]. An interesting feature of 
the 40 strains listed in [see Additional file 1: Table SI] is 
that 78% are pathogenic bacteria. Thus, even taking into 
account that approximately one third of all sequenced 
genomes belong to bacteria that live in association with 
eukaryotes [43], there appears to be a bias towards this 
kind of organisms among prokaryotic genomes with a 
generally high frequency of highly similar synologs. 
There is however otherwise great variation between 
these 40 strains, as they encompass both plant and ani- 
mal pathogens, and intracellular as well as extracellular 
lifestyles. The majority of the genomes in [see Additional 
file 1: Table SI] have previously been reported to be rich 
in mobile genetic elements and/or DNA repeats [44-70], 
indicating a considerable contribution from such elements 
to very high levels of highly similar synologs overall, as 
could be expected. 

The A. vinelandii DJ genome was found to have a 
synolog fraction of 4% at the given cutoff. While this 
does place A. vinelandii DJ in the top 20% when the 
analysed genomes are ranked according to synolog frac- 
tion, it is far from as extreme as the frequency observed 
when only carbohydrate metabolism genes were consid- 
ered, showing that the exceptional abundance of carbo- 
hydrate metabolism synologs observed in this organism 
is not simply due to an unusually high level of highly 
similar synologs in general. 

Conclusions 

The soil bacterium A. vinelandii harbours an unusually 
large number of highly similar carbohydrate metabolism 
synologs (but not highly similar synologs in general) rela- 
tive to other sequenced bacterial and archaeal strains. The 
majority of these synologs occur in core metabolic pro- 
cesses. Highly similar carbohydrate metabolism synologs 
are generally rare in the genomes of prokaryotes, but are 
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observed in some additional cases and there seems to be a 
connection between lifestyle or environmental factors and 
the distribution of such synologs. The majority of ge- 
nomes with high levels of such synologs were from soil 
bacteria. In these genomes, many of the highly similar 
carbohydrate metabolism synologs were observed to be 
non-tandemly organized and localized in varying but func- 
tionally relevant genomic contexts, indicating that these 
synologs are not the result of recent duplications or re- 
peated transfer events despite having nearly identical pro- 
tein sequences. This supports the hypothesis that there 
are adaptive benefits in conservation of these highly simi- 
lar synologs, most likely due to a more flexible regulation 
of expression and/or increased gene dosages, although this 
is just one of several possible strategies for adaptation. 
Most likely, the initial "choice" between alternative strat- 
egies for a given situation is affected by the available 
resources (e.g. environmental nutrients, potential for hori- 
zontal genetic exchange), the cost (increased metabolic 
load, slower growth rate etc.) and the benefits (e.g. en- 
hanced nutrient utilization, more effective transition be- 
tween niches) of a given strategy for the organism. 

Methods 

A. vinelandii and Pseudomonas protein families 

Protein families containing two or more members were 
identified in the genomes of A. vinelandii DJ [GenBank: 
NC_012560], P. aeruginosa UCBPP-PA14 [GenBank: 
NC_008463], P. aeruginosa PA7 [GenBank:NC_009656], 
P. aeruginosa PAOl [GenBank:NC_002516], P. entomophila 
L48 [GenBank:NC_008027], P. /worescew PfO-1 [GenBanl<: 
NC_007492], P. fluorescens Pf5 [GenBank:NC_004129], 
P. mendocina ymp [GenBank:NC_009439], P. putida Fl 
[GenBank:NC_009512], P. putida KT2440 [GenBank: 
NC_002947], P. putida W619 [GenBank:NC_010501], 
P. putida GB-1 [GenBank:NC_010322], P. stutzeri A1501 
[GenBank:NC_009434], P. syringae pv. phaseolicola 1448a 
[GenBank:NC_005773], P. syringae pv. syringae B728a 
[GenBank:NC_007005] and P. syringae pv. tomato str. 
DC3000 [GenBank:NC_004578] using OrthoMCL. 
OrthoMCL is a genome-scale algorithm for grouping 
orthologous protein sequences, which can also be used 
to provide groups representing species-specific gene ex- 
pansion families [20]. The identified protein families 
were grouped into functional categories manually based 
on the available annotation information for the genes in 
each protein family. 

SEED data sets 

Data from 943 archaeal and bacterial genomes were ex- 
tracted from the SEED subsystem database [19] in October 
2010. Synologs were defined as intra-genome sequences 
assigned to the same FIGfam. The Networks-Based SEED 
API [71] was downloaded and used to extract synolog 



nucleotide sequences contributing to the subsystems. The 
sequences were translated and ClustalW [72] was run on 
all intra-genome synolog pairs. Protdist from the PHYLIP 
package [73] was then run on each synolog pair to calcu- 
late distances (i.e. sequence identity) using the similarity 
table option. This approach is more sensitive than the 
Blast approach described below, but also much slower. 
However, as we here compared only a relatively small set 
of sequence pairs, and not all vs. all as in the NCBI data 
set analysis, the lack of speed was not a limiting factor. 
Scripting and statistical analysis was carried out with local 
tools written in Python [74]. This approach did not include 
a length requirement for the sequence alignments, but 
since large length differences were not prevalent among 
the compared sequence pairs ( [see Additional file 1: 
Figure S2]) this was not problematic for our analysis. 

NCBI data sets 

897 complete prokaryotic proteomes were downloaded 
from the NCBI ftp-server [75] in June 2009. For each 
proteome, a blastp [76] search against itself was performed 
and synologs were identified using 90% sequence identity 
thresholds along >90% of the full length of the compared 
sequences at the protein level. This is similar to the ap- 
proach used by OrthoMCL, but with stricter thresholds. 
At high identity thresholds, the difference between using 
similarity or identity is small. Maximum number of BLAST 
hits was set to 1000, unless the E-score from BLAST 
exceeded a threshold of 10" . 

Availability of supporting data 

The data supporting the results of this article are 
included within the article and its additional file. 
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